{
 "nbformat": 4,
 "nbformat_minor": 0,
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)"
   ],
   "metadata": {
    "id": "Wf4-YfQC2EdS"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/ocr/ocr_visual_document_classifier.ipynb)"
   ],
   "metadata": {
    "id": "UvR9k1032Jcu"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "# **VisualDocumentClassifier**\n",
    "\n",
    "\n",
    "The **VisualDocumentClassifier** is a DL model for document classification using text and layout data. The currently available pre-trained model on the Tobacco3482 dataset contains 3482 images belonging to 10 different classes (Resume, News, Note, Advertisement, Scientific, Report, Form, Letter, Email and Memo)\n",
    "\n",
    "**All the available models:**\n",
    "\n",
    "| language | nlu.load() reference      | Spark NLP Model Reference              |\n",
    "|----------|---------------------------|----------------------------------------|\n",
    "| en       | en.classify_image.tabacco | visual_document_classifier_tobacco3482 |"
   ],
   "metadata": {
    "id": "JJ6YV5Qaz3Mc"
   }
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "id": "v5dxduk42r03"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## **Starting the session**"
   ],
   "metadata": {
    "id": "iSYNEpL02oh_"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "_KRf7uGFz1Fv"
   },
   "outputs": [],
   "source": [
    "from johnsnowlabs import nlp\n",
    "nlp.install(visual=True)\n",
    "nlp.start(visual=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "## **Visual Document Classifier**"
   ],
   "metadata": {
    "id": "9PqdTFpK5Pmq"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Parsed Nlu_ref=en.classify_image.tabacco as lang=en\n",
      "21:23:18, INFO Parsed Nlu_ref=en.classify_image.tabacco as lang=en\n",
      "Parsed Nlu_ref=en.classify_image.tabacco as lang=en\n",
      "21:23:18, INFO Parsed Nlu_ref=en.classify_image.tabacco as lang=en\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warning::Spark Session already created, some configs may not take.\n",
      "Warning::Spark Session already created, some configs may not take.\n",
      "visual_document_classifier_tobacco3482 download started this may take some time.\n",
      "Approximate size to download 398.1 MB\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Adding visual_document_classifier to internal component_list\n",
      "21:23:50, INFO Adding visual_document_classifier to internal component_list\n",
      "Satisfying dependencies\n",
      "21:23:50, INFO Satisfying dependencies\n",
      "========================================================================\n",
      "21:23:50, INFO ========================================================================\n",
      "Resolution Status provided_features_no_ref = {'visual_classifier_confidence', 'visual_classifier_prediction'}\n",
      "21:23:50, INFO Resolution Status provided_features_no_ref = {'visual_classifier_confidence', 'visual_classifier_prediction'}\n",
      "Resolution Status required_features_no_ref = {'hocr'}\n",
      "21:23:50, INFO Resolution Status required_features_no_ref = {'hocr'}\n",
      "Resolution Status provided_features_ref    = set()\n",
      "21:23:50, INFO Resolution Status provided_features_ref    = set()\n",
      "Resolution Status required_features_ref    = set()\n",
      "21:23:50, INFO Resolution Status required_features_ref    = set()\n",
      "Resolution Status is_trainable             = False\n",
      "21:23:50, INFO Resolution Status is_trainable             = False\n",
      "Resolution Status conversion_candidates    = []\n",
      "21:23:50, INFO Resolution Status conversion_candidates    = []\n",
      "Resolution Status missing_features_no_ref  = {'hocr'}\n",
      "21:23:50, INFO Resolution Status missing_features_no_ref  = {'hocr'}\n",
      "Resolution Status conversion_candidates    = set()\n",
      "21:23:50, INFO Resolution Status conversion_candidates    = set()\n",
      "========================================================================\n",
      "21:23:50, INFO ========================================================================\n",
      "Getting default for missing_feature_type=hocr\n",
      "21:23:50, INFO Getting default for missing_feature_type=hocr\n",
      "Resolved for missing components the following NLU components : [<nlu.pipe.nlu_component.NluComponent object at 0x0000021C89C01CD0>]\n",
      "21:23:50, INFO Resolved for missing components the following NLU components : [<nlu.pipe.nlu_component.NluComponent object at 0x0000021C89C01CD0>]\n",
      "adding image2hocr\n",
      "21:23:50, INFO adding image2hocr\n",
      "Adding image2hocr to internal component_list\n",
      "21:23:50, INFO Adding image2hocr to internal component_list\n",
      "========================================================================\n",
      "21:23:50, INFO ========================================================================\n",
      "Resolution Status provided_features_no_ref = {'hocr', 'visual_classifier_confidence', 'visual_classifier_prediction'}\n",
      "21:23:50, INFO Resolution Status provided_features_no_ref = {'hocr', 'visual_classifier_confidence', 'visual_classifier_prediction'}\n",
      "Resolution Status required_features_no_ref = {'hocr', 'ocr_image'}\n",
      "21:23:50, INFO Resolution Status required_features_no_ref = {'hocr', 'ocr_image'}\n",
      "Resolution Status provided_features_ref    = set()\n",
      "21:23:50, INFO Resolution Status provided_features_ref    = set()\n",
      "Resolution Status required_features_ref    = set()\n",
      "21:23:50, INFO Resolution Status required_features_ref    = set()\n",
      "Resolution Status is_trainable             = False\n",
      "21:23:50, INFO Resolution Status is_trainable             = False\n",
      "Resolution Status conversion_candidates    = []\n",
      "21:23:50, INFO Resolution Status conversion_candidates    = []\n",
      "Resolution Status missing_features_no_ref  = {'ocr_image'}\n",
      "21:23:50, INFO Resolution Status missing_features_no_ref  = {'ocr_image'}\n",
      "Resolution Status conversion_candidates    = set()\n",
      "21:23:50, INFO Resolution Status conversion_candidates    = set()\n",
      "========================================================================\n",
      "21:23:50, INFO ========================================================================\n",
      "Getting default for missing_feature_type=ocr_image\n",
      "21:23:50, INFO Getting default for missing_feature_type=ocr_image\n",
      "Resolved for missing components the following NLU components : [<nlu.pipe.nlu_component.NluComponent object at 0x0000021CC2007DF0>]\n",
      "21:23:50, INFO Resolved for missing components the following NLU components : [<nlu.pipe.nlu_component.NluComponent object at 0x0000021CC2007DF0>]\n",
      "adding binary2image\n",
      "21:23:50, INFO adding binary2image\n",
      "Adding binary2image to internal component_list\n",
      "21:23:50, INFO Adding binary2image to internal component_list\n",
      "========================================================================\n",
      "21:23:50, INFO ========================================================================\n",
      "Resolution Status provided_features_no_ref = {'hocr', 'visual_classifier_confidence', 'ocr_image', 'visual_classifier_prediction'}\n",
      "21:23:50, INFO Resolution Status provided_features_no_ref = {'hocr', 'visual_classifier_confidence', 'ocr_image', 'visual_classifier_prediction'}\n",
      "Resolution Status required_features_no_ref = {'hocr', 'ocr_image'}\n",
      "21:23:50, INFO Resolution Status required_features_no_ref = {'hocr', 'ocr_image'}\n",
      "Resolution Status provided_features_ref    = set()\n",
      "21:23:50, INFO Resolution Status provided_features_ref    = set()\n",
      "Resolution Status required_features_ref    = set()\n",
      "21:23:50, INFO Resolution Status required_features_ref    = set()\n",
      "Resolution Status is_trainable             = False\n",
      "21:23:50, INFO Resolution Status is_trainable             = False\n",
      "Resolution Status conversion_candidates    = []\n",
      "21:23:50, INFO Resolution Status conversion_candidates    = []\n",
      "Resolution Status missing_features_no_ref  = set()\n",
      "21:23:50, INFO Resolution Status missing_features_no_ref  = set()\n",
      "Resolution Status conversion_candidates    = set()\n",
      "21:23:50, INFO Resolution Status conversion_candidates    = set()\n",
      "========================================================================\n",
      "21:23:50, INFO ========================================================================\n",
      "!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!* ALL DEPENDENCIES SATISFIED !*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*\n",
      "21:23:50, INFO !*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!* ALL DEPENDENCIES SATISFIED !*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*\n",
      "Fixing column names\n",
      "21:23:50, INFO Fixing column names\n",
      "Fixing input and output column names\n",
      "21:23:50, INFO Fixing input and output column names\n",
      "Checking for component_to_resolve visual_document_classifier wether inputs {'hocr'} is satisfied by another component_to_resolve in the component_list \n",
      "21:23:50, INFO Checking for component_to_resolve visual_document_classifier wether inputs {'hocr'} is satisfied by another component_to_resolve in the component_list \n",
      "Checking for component_to_resolve image2hocr wether inputs {'ocr_image'} is satisfied by another component_to_resolve in the component_list \n",
      "21:23:50, INFO Checking for component_to_resolve image2hocr wether inputs {'ocr_image'} is satisfied by another component_to_resolve in the component_list \n",
      "Checking for component_to_resolve binary2image wether inputs {'content', 'path'} is satisfied by another component_to_resolve in the component_list \n",
      "21:23:50, INFO Checking for component_to_resolve binary2image wether inputs {'content', 'path'} is satisfied by another component_to_resolve in the component_list \n",
      "Optimizing component_list component_to_resolve order\n",
      "21:23:50, INFO Optimizing component_list component_to_resolve order\n",
      "Starting to optimize component_to_resolve order \n",
      "21:23:50, INFO Starting to optimize component_to_resolve order \n",
      "Optimizing order for component_to_resolve visual_document_classifier\n",
      "21:23:50, INFO Optimizing order for component_to_resolve visual_document_classifier\n",
      "Optimizing order for component_to_resolve image2hocr\n",
      "21:23:50, INFO Optimizing order for component_to_resolve image2hocr\n",
      "Optimizing order for component_to_resolve binary2image\n",
      "21:23:50, INFO Optimizing order for component_to_resolve binary2image\n",
      "Optimizing order for component_to_resolve visual_document_classifier\n",
      "21:23:50, INFO Optimizing order for component_to_resolve visual_document_classifier\n",
      "Optimizing order for component_to_resolve image2hocr\n",
      "21:23:50, INFO Optimizing order for component_to_resolve image2hocr\n",
      "Optimizing order for component_to_resolve visual_document_classifier\n",
      "21:23:50, INFO Optimizing order for component_to_resolve visual_document_classifier\n",
      "Optimizing order for component_to_resolve image2hocr\n",
      "21:23:50, INFO Optimizing order for component_to_resolve image2hocr\n",
      "Optimizing order for component_to_resolve visual_document_classifier\n",
      "21:23:50, INFO Optimizing order for component_to_resolve visual_document_classifier\n",
      "Optimizing order for component_to_resolve visual_document_classifier\n",
      "21:23:50, INFO Optimizing order for component_to_resolve visual_document_classifier\n",
      "Renaming duplicates cols\n",
      "21:23:50, INFO Renaming duplicates cols\n",
      "Done with component_list optimizing\n",
      "21:23:50, INFO Done with component_list optimizing\n",
      "Fitting on empty Dataframe, could not infer correct training method. This is intended for non-trainable pipelines.\n",
      "21:23:50, INFO Fitting on empty Dataframe, could not infer correct training method. This is intended for non-trainable pipelines.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warning::Spark Session already created, some configs may not take.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Configuring Light Pipeline Usage\n",
      "21:23:52, INFO Configuring Light Pipeline Usage\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warning::Spark Session already created, some configs may not take.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Inferred and set output level of pipeline to document\n",
      "21:23:53, INFO Inferred and set output level of pipeline to document\n",
      "Extracting for same_level_cols = ['text']\n",
      "\n",
      "21:24:07, INFO Extracting for same_level_cols = ['text']\n",
      "\n"
     ]
    }
   ],
   "source": [
    "p = nlp.load('en.classify_image.tabacco',verbose=True)\n",
    "res = p.predict('cv_test.png')"
   ],
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-01-02T15:54:07.633782200Z",
     "start_time": "2024-01-02T15:53:18.012906100Z"
    },
    "id": "IS6wleBN4ynd",
    "outputId": "6e8ebc43-92c3-4821-a495-6c515962152e"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [
    {
     "data": {
      "text/plain": "                                           file_path  \\\n0  file:/F:/Work/repos/nlu/tests/nlu_ocr_tests/cv...   \n\n   visual_classifier_confidence visual_classifier_prediction  \n0                      0.990776                       Resume  ",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>file_path</th>\n      <th>visual_classifier_confidence</th>\n      <th>visual_classifier_prediction</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>file:/F:/Work/repos/nlu/tests/nlu_ocr_tests/cv...</td>\n      <td>0.990776</td>\n      <td>Resume</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res"
   ],
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-01-02T15:54:28.984492300Z",
     "start_time": "2024-01-02T15:54:28.925420500Z"
    },
    "id": "FxOSw_JS4ynd",
    "outputId": "392b617e-cb07-42e5-814f-81b3e8c8a0d6"
   }
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "id": "x-9XHEyd2qvy"
   }
  }
 ]
}
